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In this paper, we consider a queue-aware distributive resource control algorithm for two-hop MIMO 
cooperative systems. We shall illustrate that relay buffering is an effective way to reduce the intrinsic 



o 

' half-duplex penalty in cooperative systems. The complex interactions of the queues at the source node 

O ! 

and the relays are modeled as an average-cost infinite horizon Markov Decision Process (MDP). The 
traditional approach solving this MDP problem involves centralized control with huge complexity. To 
obtain a distributive and low complexity solution, we introduce a linear structure which approximates the 

00 . 

value function of the associated Bellman equation by the sum of per-node value functions. We derive a 
, distributive two-stage two-winner auction-based control policy which is a function of the local CSI and 

local QSI only. Furthermore, to estimate the best fit approximation parameter, we propose a distributive 
online stochastic learning algorithm using stochastic approximation theory. Finally, we establish technical 
conditions for almost-sure convergence and show that under heavy traffic, the proposed low complexity 
distributive control is global optimal. 



I. Introduction 

Cooperative relay communication has been a hot research topic in both the academia HI, O and 
the industry 0> El because it could exploit the broadcast nature of wireless communication to achieve 
cooperative diversity. One potential issue of cooperative communication is the half-duplex penalty in the 
relay nodes. There have been some recent works to address the half-duplex issue in cooperative relay 
systems. For example, complex echo cancelation technique is used at the relay to cancel the coupled 
interference from the transmitting path Q, O. However, these works all focused at the physical layer 
signal processing. In Q, the authors exploit special topology and proposed some relay protocols to get 
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rid of the half-duplex penalty. Moreover, this approach depends heavily on the locations of the relays and 
it cannot be extended to general relay channel. In this paper, we are interested to explore a system level 
solution to deal with the half-duplex issue. We consider a simple MIMO cooperative relay system with a 
multi-antenna source node (Src), M multi-antenna relay nodes (RS) and a multi-antenna destination node 
(Dst). We shall illustrate that relay buffering can be utilized to significantly reduce the intrinsic half-duplex 
penalty. Since buffering is involved, it is important to consider not only the throughput performance but 
also the associated end-to-end delay performance. As a result, we shall focus on delay-optimal resource 
control for the two-hop protocol in MIMO cooperative relay systems. 

Delay-optimal resource control in cooperative relay system is a very difficult problem. Most of the 
existing works have assumed infinite backlogs of information and focus on optimizing the throughput 
performance only. A systematic approach is to model the delay-optimal control as Markov Decision 
Process (MDP) (H, O- However, there is a well-known issue of the curse of dimensionality and brute 
force value iteration or policy iteration could not give simple implementable solutionis. For multi-hop 
systems, there is a unique challenge concerning the complex interactions of buffers at the source node 
and the M RS nodes and the existing solutions for single-hop systems cannot be extended easily to deal 
with this situation. There are a few recent works that considered queue dynamics in relay systems ifTUl . 
ifTTTl . However, these works have focused on the characterization of the stability region and throughput 
optimal control. The question of delay-optimal control for cooperative relay system remains to be open. 
In addition, another important technical challenge is the distributive implementation consideration. For 
instance, the entire system state could be characterized by the global CSI (CSI among every pair of nodes 
in the system) as well as the global QSI (QSI of every buffer in the system). Brute-force solution of the 
MDP will yield a control policy that is adaptive to the global CSI and global QSI. This poses a huge 
implementation challenges because these global system state information are distributed locally at each of 
the source and relay nodes. 

In this paper, we shall address the above challenges as follows. We shall first formulate the delay-optimal 
resource control policy (such as the power control and RS selection) as an average-cost infinite horizon 
Markov Decision Process (MDP). To alleviate the curse of dimensionality, and to obtain a distributive and 

'For example, for a system with maximum buffer length of 20, 3 CSI states and M RSs, the total number of system states is 

20 M+1 x 3 2iU whkh 

is unmanageable even for small number of RS. 
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low complexity solution, we first introduce a per-node value function to approximate the value function of 
the associated Bellman equation. Based on the per-node value function, we derive a distributive two-stage 
two-winner auction-based control policy, which is a function of the local CSI and local QSI. The per-node 
value function is obtained via a distributive online stochastic learning algorithm, which requires local 
CSI and local QSI only. The proposed online stochastic learning is quite different from the conventional 
reinforced learning lTT2l in mainly two ways: (1) We are dealing with constrained MDP (CMDP) and our 
online iterative solution updates both the value function and the Lagrange multipliers (LM) simultaneously; 
(2) The control action is determined from the per-node value function of all the nodes via a per-slot auction 
mechanism. Therefore, the algorithm dynamics of the per-node online learning is not a contraction mapping 
and hence, standard convergence proof using fixed point theorem cannot be applied in our case directly. 
Using the technique of separation of different time scales, we establish technical conditions for the almost 
sure convergence of the proposed distributive stochastic learning. We also show that the proposed low 
complexity distributive solution is asymptotically global optimal under heavy traffic loading. Finally, we 
demonstrate by simulation that the proposed scheme has significant performance gain over various baselines 
(such as conventional CSIT-only control and the throughput-optimal control (in stability sense)) with low 
complexity 0(M) and low signaling overhead. 



A. System Architecture and MIMO Relay Physical Layer Model 

We consider a two-hop multi-antenna cooperative relay communication system with one multi-antenna 
source node (Nt antennas), M multi-antenna half-duplex relay stations (RS, each with Nr antennas) and 
one multi-antenna destination node (Nt antennas), as illustrated in Fig. [TJ The source node cannot deliver 
packets directly to the destination node due to limited coverage and the cooperative RSs are deployed to 
extend the source node's coverage. 



Denote the Rx-RS and the Tx-RS as the m-th RS and the n-th RS for notation simplicitjo Let N$r and 
Nrd be the number of data streams transmitted in the S-R link and the R-D link respectively, where we 
require Nrd = mm(NT, Nr — Nsr) for simultaneous interference-free transmission. We shall illustrate 
the signal model of the S-R m link and the R n -D link as follows: 

2 Since the RSs are half-duplex under practical consideration, we require m ^ n implicitly. 



II. System Models 
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Fig. 1 . Illustration of the two-hop MIMO cooperative system with a multi-antenna source node, 2 multi-antenna RS nodes and a 
multi-antenna destination node. By exploiting buffers at the 2 MIMO RSs, the S-R link (source node to RSI) and R-D link (RS2 to 
destination node) can deliver packets simultaneously without interfering with each other using signal processing techniques (with 
appropriate precoder and decorrelator designs). Thus, by exploiting relay buffering, we could substantially reduce the intrinsic 
penalty associated with half duplex relays. 



. S-R m link: Let X s G C NsnXl and F s G C NtXNsr be the symbol vector and the precoder matrix of 
the source node respectively, G m G c NsrXNr be the decorrelator matrix at the m-th RS node, the 
Nsr x 1 post-processing symbol vector at the m-th RS is given by Y m = G m Hs )m F5X,g + 7,s, m , 
where Hs,m £ C NrXNt is the zero-mean unit variance i.i.d. complex Gaussian fading matrix from 
the source node to the m-th RS, 7is, m G C NsrX1 is the zero-mean unit variance complex Gaussian 
channel noise. 

. R n -D link: Let X„ G C NrdX1 and F n G C NrxNrd be the transmit symbol vector and the precoder 
of the ra-th RS respectively, the Nt X 1 received symbol vector at the destination node is given bj^l 
Yd = H n £)F n X n + Z n £>, where H n £> G C NtXNr is complex Gaussian fading matrix from the 
n-th RS to the destination node, Z Wj £> G C NrXl is the complex Gaussian channel noise. 
In this paper, the resource control is performed distributively on each RS and therefore, we define the 
local channel state information (CSI) available at each RS as follows. For the m-th RS, there are two 
types of local CSI, namely the type-I local CSI and type-II local CSI as illustrated in Fig. [2] The type-I 
and type-II local CSI of the m-th RS are denoted by = {Yls,m} and H^f = {H mv o} U {H m n |n ^ 

3 Due to the limited coverage of the source node, we assume the received signal from the source node is negligible compared 
with the received signal from the relay node. 
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m, 1 < n < M}, respectively. For notation convenience, let H m = U be the local CS|j at the 
m-th RS and H = u£f =1 H m be the global CSI (GCSI) of the system. Moreover, the assumption on the 
channel is summarized below: 

Assumption 1 (Assumption on Channel Fading): We assume the channel fading elements in the global 
CSI H are Ltd. CM(0, 1). The CSI is quasi-static within a frame but Ltd. between frames. 

We assume strong channel coding is used and hence, the maximum achievable data rate is given by the 
instantaneous mutual informatiorjf]. If the source node transmits Rs, m information bits to the m-th RS in the 
current frame, the frame will be successfully received if Rs,m < r l°g2 det [l+G m Hg jfri F5FgHg m GL] , 
where ' denotes the matrix conjugate transpose and r is the frame duration. Similarly, the destination 
node could successfully decode a frame with R H: d information bits (transmitted from the n-th RS) if 
Rn,D < t log 2 det [I + H niD F n Fj,l4 iD ] . 

B. Buffered Decode and Forward 

Although the RS nodes are half-duplex relay j^, it is possible to reduce the system half-duplex penalty 
by exploiting buffers at the half-duplex RSs. Specifically, the source node could transmit a packet to the 
m-th RS (denoted as the Rx-RS) and at the same time, the n-th RS (denoted as the Tx-RS) transmits 
its buffered packet to the destination node without interfering the Rx-RS. This is possible by means of 
precoder-decorrelator designs at the source node, Rx-RS (m-th RS) and the Tx-RS (n-th RS). Let ps,m 
and p rh o denote the total transmit power at the source node for the S-R m link and the Tx-RS (n-th RS) 
for the R n -D link, respectively. For any given N$r, ps.m for the S-R m link as well as Nrd, Pn,D for 
the R n -D link (where Nrd = min(A^y, Nr — N$r) implicitly), the decorrelator and precoder designs are 
elaborated below. 

• Precoder and Decorrelator Design of the S-R m Link at the Rx-RS Nodqj The precoder at the 
source node (Fg) and the decorrelator at the Rx-RS node (G m ) are chosen to optimize the mutual 

4 Note that both the type-I and type-II local CSI at the m-th RS refers to all the outgoing links from the m-th RS and hence, 
they can be measured at the m-th RS using channel reciprocity and preambles. For example, there are standard signaling and 
channel sounding mechanisms in the WiMAX (802. 16j, 802.16m) and LTE systems for the RS to acquire the local CSI. 

3 For example, LDPC with reasonably large block length (e.g 8kbyte) can achieve the instantaneous mutual information within 
0.5dB SNR fill . 

6 Half-duplex relay means that the RS nodes do not have any Tx/Rx echo-cancelation capability. 

7 Type-I local CSI H.^ is required at the m-th Rx-RS node to compute the precoder and decorrelator of the S-R m link. 
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information of the S-R m link subject to the transmit power constraint as follows: 

{G* m (N SR ),F* s (N SR ,ps,m)} = arg max log 2 det [i + G m (Ng R )H 8 , m F s F] ) H} } m Gl(N SR 



F S ,G 



s.t. ^(FgF^) = ps, m (Transmit power constraint). (1) 

Let Hs, m = ^Js,m^S,mV s m ^ e t ne SVD decomposition of channel matrix H^ m , where the singular 
values in £s,m are sorted in a decreasing order along the diagonal, Us, m = [ u s m > •■•> n sm\ an ^ 
Vs, m = [ v 5 m? •■■) v 5ml- Using standard optimization techniques lfT4l . the source precoder F* s is 
given by 

F s (N SR ,p s>m ) = [v^, m ,...,vg-] x diagl^- - -^-,-,t^ j^}, (2) 

where 77^ > r) Sm ... > rjgm are ^ e ^ rs ^ ^SR singular values of channel matrix Hs,m, ^s,m is the 
Lagrange multiplier corresponding to the transmit power constraint in (Q]). The decorrelator is 
given by 



G* m (N SR ) = [nl,...,vL^. (3) 



• Precoder Design of the R„-D Link at the Tx-RS Nodej: Similarly, given the decorrelator G* n 
in (O, the precoder at the Tx-RS node F n € C NrdXNr is selected to maximize R-D link mutual 
information subject to the nansmit power constraint and the interference nulling constraint (at the 
Rx-RS node) as follows: 

F* n (N RD ,p n:m ) = argmaxlog 2 det I + H n) jr)F n F^H^ D 

s.t. G^(A r 5^)H n m F n = (Interference nulling constraint) (4) 

£r(F re FT ) = p nt £) (Transmit power constraint) (5) 

The interference nulling constraint in (0]) is to allow simultaneously active R-D and S-R links using 
half-duplex RSs. Let H n £> x nnZ/(G m H n m ) = U n ,D^n,Dv n D be the SVD decomposition, where 
the singular values in Tl n ,D are sorted in a decreasing order along the diagonal, null{G m ~H. n ^ m ) 
denotes the null space of matrix G m H njrn and V n> D = [v^ D , v^i)~ Nsit ]. Using standard opti- 
mization techniques |[T4l . the precoder at the Tx-RS (F*) is given by: 



K(N RD ,Pn, m ) = fe, x diag{^- - J-,..., J-- -L_ }, 

*• *n,D rj n D A Ut D T] n D > 



8 Type-II local CSI is required at the n-th Tx-RS node to compute the precoder of the R n -D link. 



(6) 
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where rr\ D > rfc D ... > rjg 1 ^ are the first Nrd singular values of channel matrix H n> £> xn«H(G m H njm ), 
A n D is the Lagrange multiplier corresponding to the power constraint in (0. 

C. Bursty Source Model and Queue Dynamics 

There is one queue in the source node and one queue in each of the M RSs respectively for the storage 
of received information bits. Let Nq be the maximum buffer size (number of bits) for the buffers in the 
source node and all the RSs. Let X(t) indicates the number of new information bits arrival in the t-th 
frame at the source node. The assumption on the bit arrival process is given below: 

Assumption 2 (Assumption on Arrival Process): We assume X(t) is i.i.d. over frames based on a 
general distribution fx{x) with JL[X(t)\ = \$ and the information bits arrive at the end of each frame. 
Moreover, let Qs(t) and Q m (t) denote the number of information bits in the source node's queue and 
the m-th RS's queue (1 < m < M) at frame t. We assume each RS has the knowledge of its own 
queue length and the source node's queue length. Thus, the local QSI of the m-th RS is (Qs(i), Qm{t))- 
Q(t) = (Qs(t),Qi(t), ■ ■ ■ , QmW) denotes the global queue state information (GQSI) at frame t. 

The overall system queue dynamics at the source node and the RSs are summarized below. 

• If the source node successfully delivers Rs,m{t) information bits to the m-th RS at frame t, then 
Q s (t+1) = min {max{Q 5 (t) - R s , m (t), 0} + X(t), N Q } and Q m (t+1) = min {Q m (t) + R s , m {t),N Q }. 

• If the source node fails to deliver any information bit to the RSs , then Qs(t+1) = min {Qs{t) + X(t), Nq}. 

• If the ro-th RS successfully delivers R n ,D(t) information bits to the destination at frame t, then 
Q n (t + 1) = max{Q n (t) - R n , D {t), 0}. 

Remark 1: Each information bit delivered from the source node will be received by one of the RSs 
and different RSs may have different information bits in the buffer. When the source node is to deliver 
information bits to one RS, selecting different RSs with different buffer lengths may have different effects 
on the average packet delay of the system. Therefore, not only the CSI of all S-R links but also the 
QSI of all RSs should be considered in directing the source node's transmission. Such coupling on the 
system QSI is a unique challenge in delay-optimal control of multi-hop systems. Fig. Q] shows the top 
level architecture illustrating the interactions among all the queues in the two-hop cooperative system. 
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Suppose B2<B1, then the transmission parameters are: 


First-Stage 
Auction 


Second- 
Stage 


Rx-RS: RS1 — Src delivers packet #4 to RS1. 

Tx-RS: RS2 — RS2 delivers buffered padket #3 to Dest. 


Auction 


Number of Src-RS1 data streams: 1, 


/ 


Y 


Number of RS2-Dest data Streams: L 



Each RS estimates the 
Src-> RS and RS-Dest 
channels 



Second-Stage Auction: 

RSI broadcasts second-staqe bid B1 preferred Rx_R3 fsav R32.i as '.veil as number of preferred 
SR&RD data streams [2,0] (2 data streams fdr SR link & data stream Fer RD link); 
-r52 broadcasts second-stage bid 32 pceferred R::_R5 ;sav R51 j as .veil as number of preferred 
SR&RD data streams [1.1] (one data stteam for SR link and RD link respectively). 
First-Stage Auction: 

RS1 broadcasts first-stage bids {A1 (0),A1(1),A1(2)}; 
RS2 broadcasts tirscsiage bids [A2i0i A2i1).A2(2)} 



Fig. 2. Illustration of an example of bidding protocol for a 2-RS system. 



D. Distributive Contention Protocol 

Based on the BDF in Section ITl-B I we still need to determine (a) which RS should be the Rx-RS (m*), 
(b) which RS should be the Tx-RS (n*) and (c) the number of data streams transmitted by the source node 
(N% R ) and the Tx-RS (N^ D ). Due to the decentralized control requirement, we shall propose a two-stage 
two-winner auction mechanism for distributivaj contention resolution. 

Figure [2] illustrates an example of bidding protocol for a 2-RS system. As a result, the RS selection and 
data stream allocation procedure can be parameterized by a bidding vector { (A m (0), A m (mm(NT, Nr))) . 
i? m )|Vm}. We shall refer the bidding vector as the RS selection and data stream allocation policy in the 
rest of the paper. 



E. Optimization Objective and Control Policy 

Definition 1 (Distributive Stationary Control Policy): A distributive stationary control policy H = {II m |l < 
m < M} is a collection of stationary control policies U m at the m-th RS, where H m = {II™, II™, 11^} 



'Similar to the common notion of distributive algorithms in the literature 1 15 1, 1 16 1, the term "distributive" in this paper refers 
to algorithms that perform computation locally but require explicit message passing. Yet, the message passing overhead in the 
bidding process is quite mild |17|, |18|. 



October 5, 2010 



9 



includes the power allocation policy ofS-R link and R-D link H™, the first-stage and second-stage bidding 
policy IT™ and Tig 1 . Specifically, 

^p(S m ) = {ps,m.( N SR),P m ,D( N RD) ■ N SR , N RD = 0, 1, ■ ■ ■ , min(iV T , N R )} 4 Pm (7) 

^A(Sm) = {A m (N SR )\N SR = 0,l,...,mm(N T ,N R )} 4 A m (8) 

ti%(S m , u£f, =1 A m ,) = {fi m , J TO , (N SR>m ,N RDtm )\N RD = mm(N T , N R - N SR )} = B rn (9) 

for m = 1,2, ...,M, where Ps,m{^SR) is the total transmit power allocation at the source node for the 
S-R link with N$ R data streams, p m ,D(^RD) is the total transmit power allocation at the Tx-RS for the 
R-D link with N R d data streams. 

Denote the local system state of the m-th RS as S m = (Q5,Q m ,H m ) (1 < m < M). Therefore, the 
global system state is given by S = U^f =1 S m = (Q,H). 

Remark 2 (Distributive Consideration of Stationary Control Policy li in Definition^: The stationary con- 
trol policy n = {n m |l < m < M} is distributive in the sense that the policy IT m at each RS m only 
depends on the local system state S m and the broadcast bidding information available at RS m. Thus, for 
notation simplicity, we shall omit the bidding information when the meaning is clear, i.e. we shall use 
U m (S m ) = {n™(S m ),n^(S m ),irg(S m )} in the rest of the paper. 

A stationary control policy IT induces a joint distribution for the random process {S(i)}. Under As- 
sumption Q] and [21 S(i + 1) only depends on S(i) and actions at frame t, and hence the induced random 
process {S(t)} for a given control policy II is Markovian with the following transition probability: 

Pr [S(t + l)|S(t),n(S(t))] = Pr [H(t + 1)] Pr [Q(t + l)|S(t),n(S(t))], (10) 

where the equality is because of Assumption Q] and the queue dynamics transition probability Pr [Q(i + 
l)|S(t),n(S(t))] is given by 

Pr[Q(t + l)|S(i),n(S(t))] (11) 

Pr [X(t) = Q s (t + 1) - [Q S (t) - R s ,m* (t)]+] , if Q m (t + 1) = Q m (t) (Vm + m*,n*) 

and Q m , (t + 1) = min{Q m . (t) + R s , m * (t), N Q }, Q n , (t + 1) = max{Q n * (t) - Rn., D (t), 0} 

0, otherwise 
Given a unichain policy II, the induced Markov chain {S(f)} is ergodic and there exists a unique steady 
state distribution tts HI . Therefore, we have the average end-to-end delay of the two-hop cooperative RS 
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system summarized in the following lemma: 

Lemma 1 (Average End-to-End Delay): For small average packet drop rate constraint D, the average 



end-to-end delay of the two-hop cooperative RS system is given by 



1 T 

T(U) = lim - V E n [ 



1 \ ^ T^n \J2m=S Qm(t) 



"Em=S Q 



rn=S ^ m 



15 



As 



(12) 



where m = S, 1, 2, M in the equatioro E^- s means taking the expectation with respect to the induced 
steady state distribution tts (induced by the unichain control policy II) and is the average number of 
arrival bits per frame at the source node. 

Proof: Please refer to Appendix A. ■ 
Similarly, the source node's average drop rate constraino, the source node's average power constraint and 
each RS m's average power constraint are given by 



D0l)= lim ^J> n I[Qs(t) = N Q ) 



t=l 
T 



E 



i[Qs = n q ] 



M N, 



< D 



M JV m 



p s (u) = T iim I j> n [ E ikm(t)ps, m m)} = Eg, [ jr f; r ffimP5 , m ( 



m=l i=l 

JVmin 



m=l i=l 



< ft 



(13) 
(14) 



P m (n) = lim 



T->oo 



t=l 1=1 



(*)pjn,D(i) 



. Ar min 



<P R ,l<m<M (15) 



where iV min = min(iV T ,iV R ), P Sm = l[m = m*]I[i = N* R ] and P m D = I[m = n*]I[i = N* RD ] 



III. Constrained Markov Decision Problem Formulation 

In this section, we shall formulate the delay-optimal problem as an infinite horizon average cost 
constrained Markov Decision Problem (CMDP) and discuss the general solution. 



A. CMDP Formulation 

The goal of the controller is to choose an optimal stationary feasible unichain policy II* that minimizes 
the average end-to-end transmission delay in (fT2l . Specifically, the delay-optimal control problem is 
summarized below. 

10 This abuse will also appear in the following of this paper as long as the meaning is clear. 

"Since the source node and M RSs have buffers with the same buffer size Nq, the average drop rate at each RS node is much 
lower than the average drop rate at the source node. Therefore, we omit the average drop rate constraint at each RS to simplify 
the problem. 



October 5, 2010 



11 



Problem 1 (Delay-Optimal Control Problem for MIMO Relay System): Find a feasible stationary unichain 
policy n = (n 1 , ...,n M ) such that the average end-to-end delay is minimized subject to the average drop 
rate constraint at the source node and the average power constraint at the source node and each RS nodd_:, 



i.e. min T(U) = lim ^ £f =1 E n Em= S Qm(t) s.t. (EES, d, ©. 



Problem Q] is an infinite horizon average cost constrained Markov Decision Problem (CMDP) [19] with 



system state space S = {S 1 , S 2 , • • • } = Q x % (where Q is the global QSI state space and % is the 
global CSI state space), action space V x Ax B (where V = {Vp m |Vm} is power allocation action space, 
A = {VA m |Vm} is the first-stage bidding action space and B = {VB m |Vm} is the second-stage bidding 
action space), transition kernel given by (fTOb . and the per-stage cost function <i(S,II(S)) = ^2m=s Qm- 

B. Lagrangian Approach for the CMDP 

The CMDP in Problem Q] can be converted into unconstrained MDP by the Lagrange theory |[T4l . For 
any vector of Lagrange multiplier (LM) 7 = [js,d, 7s,p> 7i,p> •••) lM,p] T , we define the Lagrangian as 

, where 



L(n, 7) = lim i Zj=i Es \g(s(t), n(s(t)) : 

1 —^+00 L \ 

M JV min m Jv min 

ff(S,n(S),7) = Qs + 7s,pY1 ^2 Il s,mPs .III 



M JV min M AT mlr . 

m=l j=l m=l i=l 

Therefore, the corresponding unconstrained MDP for a particular vector of LMs 7 is given by 

T 
t=l 

where £(7) gives the Lagrange dual function. The dual problem of the primal problem in Problem Q] is 



1 T 

G( 7 ) = min L(U, 7) = min lim - ^ E n Us(t),n(S(()) , 7 
n n T->oo 1 c — » L \ 



given by max 7 ^o 6(7)- It is shown in ||20l that there exists a Lagrange multiplier 7 > such that II* 
minimizes L(n, 7) and the saddle point condition the saddle point condition L(H, 7*) > L(n*,7*) > 
L(II*,7) holds. Using standard Lagrange theory |[T4l . II* is the primal optimal (i.e. solving Problem [D, 
7* is the dual optimal (solving the dual problem) and the duality gap is zero. Thus, by solving the dual 
problem, we can obtain the primal optimal II*. Therefore, we shall first solve the unconstrained MDP in 
(IT6T > in the following. 

For a given LM vector 7, the optimizing unichain policy for the unconstrained MDP (fT6l ) can be obtained 
by solving the associated Bellman equation w.r.t. (9, {J(S)}) as follows 

6 + J{S i ) = min {^(sSntS^T) + Pr[S j |S 4 ,n(S i )]J(S J )} VS'G5, (17) 
12 To simplify the notation, we shall normalize As = 1 in the rest of the paper. 
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where {J(S)} is the value function of the MDP and Pr[S J '|S l , II(S 1 )] is the transition kernel which can 
be obtained from (fTOl) . 9 = minn£(n,7) is the optimal average cost per stage and the optimizing policy 
is II* with n*(S l ) minimizing the R.H.S. of (fT71) at any state S*. For any unichain policy with irreducible 
Markov Chain {S(t)}, the solution to (fTTT ) is unique fT9l . We restrict our policy space to be unichain 
policiej^ and we denote II* as the optimal unichain policy. 

C. Equivalent Bellman Equation for the CMDP 

The Bellman equation in (ITvT t is a fixed point problem over the functional space and this is very 
complicated to solve due to the huge cardinality of the system state space. Brute-force solution could not 
lead to any useful implementations. In this subsection, we shall illustrate that the Bellman equation in (fTTT ) 
can be simplified into a equivalent form by exploiting the i.i.d. structure of the CSI process H(t). For 
notation convenience, we partition the unichain policy II into a collection of actions based on the QSI. 
Specifically, we have the following definition. 

Definition 2 (Partitioned Actions for the m-th Relay): Given a unichain control policy II m , we define 
Il m (Q) = H m (Qs,Qm) = {n m (Qs') Qmo H m )|VH rn } as the collection of actions under a given local 
QSI (Qs, Qm) for all possible local CSI H m . The complete policy H m for the m-th RS is therefore equal 
to the union of all the partitioned actions, i.e. IF™ = U(Q g! Q m )II m (Qs, Qm)- 

Therefore, we have II = UqII(Q) and we show that the optimal policy II* of (fT6l ) can be obtained by 
solving an equivalent Bellman equation summarized in the following lemma. 

Lemma 2 (Equivalent Bellman Equation): The control policy obtained by solving the Bellman equation 
in (fTTT ) is the same as that obtained by solving the equivalent Bellman equation defined below: 



where 9 = minn-k(n,7) is the original optimal average cost per stage, F(Q J ) = Eh[J(Q% H)|Q l ] is 
the conditional average value function for state Q\ and 



9 + V{C?) 



= mm 



n(Q') 



{g{Q\IL(Q% 7 ) +^Pr[Q^|Q i ,n(Q J )]I/(Q^)},VQ J G Q (18) 



g(Q i ,Il(Q i ), 1 )=En 5 ((Q\H),n(Q\H), 7 )|Q 



(19) 



For most of the policies we are interested, the associated Markov chain is irreducible and hence, there is virtually no loss by 
restricting ourselves to unichain policies. 
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is the conditional per-stage cost and Pr[QJ|Q i ,II(Q i )] = E H [Pr[Q J '\(Q\ H), n(Q i , H)]] is the condi- 
tional average transition kernel. 

Proof: Please refer to Appendix B. ■ 

Remark 3: Note that solving the R.H.S. of ( fT8l ) for each Q* will get an overall control policy which is 
a function of both the CSI H and QSI Q l . This is illustrated by the following example. 

Example I: Consider a simple example with global CSI state space % = {H X ,H 2 } and global QSI 
state space Q = {Q X ,Q 2 }. Hence, the control variables are collectively denoted by the policy II = 
(n(H 1 ,Q 1 ),n(H 2 ,Q 1 ), n(H 1 ,Q 2 ),n(H 2 ,Q 2 )}. Using definition [2 the partitioned actions are simply 
regroups of variables given by n(Q x ) = {n(Q 1 ,H 1 ), n(Q 1 ,H 2 )} andn(Q 2 ) = {n(Q 2 , H 1 ), n(Q 2 , H 2 )}. 
For any QSI state Q l (i = 1,2), using Lemma [2 the optimal partitioned actions n*(Q*) can be obtained 
by solving the R.H.S. of CES) as follows 

2 

n*(Q l ) = arg min { V Pr[H fe ] [ 5 ((Q\ H fc ), n(Q i , H fc ), 7 ) 

k=l 



E Pr [Q i l(Q i . H fc ), n(Q\ H fc )] V(Qj)] } (20) 



+ 

Observe that the R.H.S. of d20]) is a decoupled objective function w.r.t. the variables {n(Q\ H 1 ), n(Q\ H 2 )}. 
Hence, applying standard decomposition theory, V/c = 1, 2, we have 

n*(Q\H fc ) = arg min \g((Q\ H k ), U(Q\ U k ), 7 ) + V Pr [Q^|(Q i , H fc ), n(Q i , H fc )] V(Qi)\ 
n(Q«,H») I. of 

Using the results in Lemma|2j the optimal control of the original problem when the QSI and CSI realizations 
are (Q 1 ,!! 2 ) is n*(Q 1 ,H 2 ). Hence, the solution obtained by solving ([HI) is adaptive to both the CSI 
and QSI. 

IV. Distributive Online Algorithm Based on Approximated MDP 

There are still two major obstacles ahead. Firstly, obtaining the value functions {^(Q)} w.r.t. 03 
involves solving a system of exponential number of equations and unknowns and brute force solution has 
exponential complexity. Secondly, even if we could obtain the solution {V(Q)}> the derived control actions 
will depend on global QSI and CSI, which is highly undesirable. In this section, we shall overcome the 
above challenges using approximate MDP and distributive stochastic learning. The linear approximation 



architecture of the value function is given below Ell : 

M N Q 

V(Q) = Yl V m {q)I[Qm = q] or in the vector form V = MW, (21) 

m=S g=0 
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where we shall refer {V m (q)} as per-node value functions^ (Vm = S, 1, • • • , M) and {V(Q)} as global 
value function in the rest of this paper, V = [^(Q 1 ), V(Ql 2 l)] T is the vector form of global value 
functions, the parameter vector W and mapping matrix M is given below: 

T 



W 



M 



V S (0), Vs(N Q ), 7i(0), Ki(JVq), V S (N Q ), V M (0), Vw(JVq) 
I[Q^ = 0] ... I[Q^ = JVq] ... I[Qju = 0] ... I[Q^ = iV Q ] 



i[Qi s| 



i[Ql s| 



0] ... I[Qg'=JV ] 



S5 -0] ... W=JVq] 
where we let Vs(0) = Vi(0) = ... = V M {0) = and set Q 1 = (0, • • • ,0) (i.e. all buffer empty) as the 
reference state without loss of generality. Compared with the original value function in (fT8l ). the dimension 
of the per-node value functions is much smaller. Therefore, the per-node value function can only satisfy 
the Bellman equation (fT8l) in some pre-determined system queue states. In this paper, we shall refer the 
pre-determined subset of system queue states as the representative states ||2"T1 . Without loss of generality, 
we define the reference states Qr = {/3 m , 5 |Vm = S, 1, 2, M; q = 1,2, Nq}, where f3 m ,q denotes the 
QSI with Q m = q and Q n = Vn ^ m. Moreover, we also define the inverse mapping matrix M _1 as 



M 



-i 



IIQ 1 = /3 s ,x] ... I[Q X = Ps,n q ], 



,0 I[Q 1 = Pm,i] ICQ 1 = Pm,n q ] 



I[Qld = ySfif.i] ... I[QlSI = /3 5 A«]» - ,0 I[Ql s l = Pm,i] - I[Q |S| = /3m,atJ 
Thus, we have W = M _1 V. Instead of offline computing the best fit parameter vector W (per-node value 
function vector) w.r.t. the global value function V (which is quite complex), we shall propose an online 
learning algorithm to estimate the parameter vector W (per-node value function) in Section |IV-B[ 



A. Distributive Control Policy under Linear Value Function Approximation 

Using the approximate value function in (|2T|) . we shall derive a distributive control policy which depends 
on the local CSI and local QSI as well as the per-node value functions {V m (q)} at each node m (Vm = 
S, 1, • • • , M). Specifically, using the approximation in (|2TI ). the control policy in (fT8l) can be obtained by 
solving the following simplified optimization problem. 

14 In this paper, we assume each RS (say the m-th RS) has the knowledge of the source node's queue length Qs and its own 
queue length Q m . Therefore, the per-node value function Vs and V m is known at the m-th RS. 
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Problem 2 ( Optimal Control Action with Approximated Value Function ): For any given value function 
V{Q l ) = Em=S T,q=o V m (q)I[Q l m = q], the optimal control policy is given by 

n*(Q') = arg min (5(Q\n(Q J ), 7 ) +^Pr[Q J |Q i ,n(Q l )]F(Q^)| 

M 

= arg { Qln + IsAQ's = N Q ] + Y J fx(n)V(Q% n ) 

m=S n 

+ n 7^ EH [ E I S,™ G S,m(N S R,PS,m)+ £ GnM^RD , Pn, D )}} 

^ m,N S R n,N RD 

^argmmE H [ £ Is,m G 8,m{N S R,Ps,m) + E ^WGnM^RD^D)]} (22) 

m,N S R n,N ar > 

where Q^ n = [Q l s + n, Q\, Q\, Q l M ] and G s ,m(N SR ,p s ,m) = ls, P Ps,m + En fx(n) (v s (Q^ - 
Rs,m(N S R,ps,m)+n) -%(Q i s +n)J +V m (Q t m +Rs ) m.(N S R,ps,m)) "Knfc) and G n)D (N RD ,p njD ) = 

ln,pPn,D + V n {Qn ~ Rn,o{ N RD iPnfi)) ~ V n {Q l n ). 

The solution of Problem |2] is summarized in Lemma [3] below. 

Lemma 3 (Distributive Control Policy): Given the per-node value functions {V m (q)} (V?n = S, 1, M) 
and any realization of CSI H and QSI Qn the following distributive control solves the Problem |2j 

. Power control for the S-R link and R-D link (Vm = 1, • • • , M): 

P*Sm(. N SR) = axgm.mG s , m {N SR ,p Stm ) and p* D (N RD ) = arg min G n ,D(N RD ,p n>D ) (23) 

Ps.m P m ,D 

where N SR , N RD = 0, 1, min(7V T , N R ). 
• First-stage bid at RSs (Vm = 1, • • • , M): 

A* m (N SR ) = G s ,m (N SR , P* S , m (N SR )) (24) 

where N$ R = 0, 1, min(A^r, N R ). 
. Second-stage bid at RSs (Vn = 1, • • • , M): 

{l n ,N SRtn ) =arg min { A* m (N SR ) + G n , D (N RD ,p* D (N RD ))} 

(m,N S R) <- > 
B* n =Gsj n (Ns Rt n,P*S,m 

(N S R,n)) +G niD (N RD 

,n,Pn,D (N RD ,n)) (25) 

where N RD = min(iV T , N R - N SR ). 

15 Note that the following expressions are all functions of the systems state. We omit the system state for notation simplicity 
when the meaning is clear. 
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In addition, for sufficiently large source arrival rate Ag, and the average transmit power constraints 
{Ps,Pr}, the power control policy in d23l ) has the following closed-form expression: 



N SR Vs(Q i s )-V m (Q i m)] ! 

Pm,D\^ RD) ~ ^ V -3 , (27) 

where = Vs^+D-Vs^-D and y^Qij = v^+D-v^-i) _ 

Proof: Please refer to Appendix C. ■ 
Remark 4 (Multi-level Water-Filling Structure of the Control Policy): The power control policy d26l ) and 
(f27j) as well as the RS selection and data stream allocation control policy in d24l ) and d25l) are functions of 
both the CSI and QSI where they depend on the QSI indirectly via the per-node value functions {V m (q)} 
(Vm = S, 1, ■ ■ ■ , M). The power control solution has the form of multi-level water-filling where the power 
is allocated according to the CSI while the water-level is adaptive to the QSI. 

B. Online Distributive Stochastic Learning Algorithm to Estimate the Per-node Value Functions {V m (q)} 
and the LMs {-ys,d,Js, P ,lm, P } 

In Lemma [3j the control actions are functions of per-node value functions {V m (q)} and the LMs 
{ls,di 7s,p> 7m,p}- In this section, we propose an online learning algorithm to determine the per-node 
value functions and the LMs realtime. The almost-sure convergence proof of this algorithm is provided 
in the next section. The system procedure of the proposed distributive online learning is given below. 

• Step 1 [Initialization]: Each RS m initiates its per-node value functions and LMs, denoted as {V^(q)} 
and 7^ , as well as the per-node value functions and LMs for the source node, denoted as {Vg(q)} 
and {is p>^s d}- The initialization of V$ and {isp^Sd} a *- eacn ^ should be the same. 

• Step 2 [Determination of control actions]: At the beginning of the i-fh frame, the source node broad- 
casts its QSI Qs(t) to the RS nodes. Based on the local system information (Qs(t), Qm{t)i Hm^)) 
and the per-node value functions {V^q)} and {Vg(q)}, each RS m determines the distributive 
control actions including the S-R and R-D power allocation Psmi^SR, t), D (Nrd , t) the first- 
stage bid A* m (N SR ,t) (Nsr = !,■■■ ,N mhl ) as well as the second-stage bid B* m (t), I n (t), N SR , n (t) 
according to Lemma [3j Based on the contention resolution protocol described in Section III-Dl 
the Rx-RS and the Tx-RS pair is given by (m* (t) , n* (t)) (where n*(t) = arg min n B* (t) and 
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m*(t) = I n *(t)(t)) and the corresponding number of data streams pair is given by (Ng R (t), N RD (t)) 

(where N* SR (t) = N SR ^ {t) (t) and N* RD (t) = N RDtn , {t) (t)). 
• Step 3 [Per-node value functions and LMs update]: Each RS m updates the per-node value function 

{^5 +1 (<?)}> {^m +1 (<7)} as wel1 as the LMs {75d 1 )7s|p 1 >7^p} according to Algorithm [U Finally, let 

t = t + 1 and go to Step 2. 
Algorithm 1 (Online distributive learning algorithm for per-node value functions and LMs): 

Vi +1 (q) =Vi(q) + e* [y S ,dl[Qs(t) = N Q ] + q + B*. (t) (t) - V^q)] I[Q(t) = /3 m , g ], m = 5, 1, M 

(28) 



\ + 



ltd ={rs,d + emQsV) = N Q ] - D) j 

7sT p 1= (7^ + 4( E ^(*W(iV M ,t)-P 5 

iVsR=l 
AT SR =1 

tN S r(+\ _ ti™ — ™*/V\iTrAr„„ — !\r* /+m t n r 



lm,p 



m = 1,2, ...,M 



(29) 
(30) 

(31) 



where Ig?(t) = I[m = m*(t)]I[jV 5fl = iV*^)], /^(i) = I[m = n*(t)]I[^flD = ^(t)], and 



{e* > 0}, {e^ > 0} {e* > 0} are the step size sequences satisfying 



t=0 



.E4 = oo >E4 = o °>E 



i=0 



i=0 



t=0 



(4) 2 + (4) 2 + (4) 



t ^2 



< oo, lim 4 = °) lim -4=0. 



t->+oo el 



t^+oo el 



C. Almost-Sure Convergence of Distributive Stochastic Learning 

In this section, we shall establish technical conditions for the almost-sure convergence of the online 
distributive learning algorithm. Since {e*}, {e*}, {e^} satisfy e* = o(e*), e l d = o(e*), the LMs update 
and the per-node potential functions update are done simultaneously but over two different time scales. 
During the per-node potential functions update (timescale I), we have 7* +1 — 7* +1 = 0(e t p ) = o(e* ) and 
7^ — 7#^ = 0(4) = °(4)- Therefore, the LMs appear to be quasi-static ll22l during the per-node value 
function update in (|28T ). For the notation convenience, define the sequences of matrices {A*} and {B*} as 
A*" 1 = (1 - 4 -1 )I + M- 1 P(n*)Me*- 1 and B* 1 = (1 - e^I + M^P^-^MeJ," 1 , where II* is a 
unichain system control policy at the t-th frame, P(II i ) is the transition matrix of system states given the 
unichain system control policy II*, I is identity matrix. The convergence property of the per-node value 
function update is given below: 
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Lemma 4 ( Convergence of Per-Node Value Function Learning over Timescale I): Assume for all the fea- 
sible policy in the policy space, there exists some positive integer (3 and t 13 > such that 

[A^- 1 ...A 1 ] (0)/) >^, [B^\..B\ aJ) >^ Va > ( 32 ) 

where [-}( a ,i) denotes the element in a-th row and I-th column (where / corresponds to the reference state 
Q 1 ) and t* = 0{e t v ) (it). The following statements are true: 

• The update of the parameter vector (or per-node potential vector) will converge almost surely for any 
given initial parameter vector W° and LMs 7, i.e. lim W*(7) = W°°(7). 

t— >oo 

• The steady state parameter vector W°° satisfies: 

9e + W°°( 7 ) = M~ 1 T( 7 , MW°°( 7 )) (33) 
where 9 is a constant, W°° is given by 

W°° = [V$°(0), V§°(N Q ), Vr(0), Vr(N Q ), V£°(N Q ), V£?(0), V A f(N Q )] T , 
and the mapping T is defined as T(7, V) = minn[g(7, II) + P(II)V]. 

Proof: Please refer to Appendix D. ■ 
Remark 5 (Interpretation of the Sufficient Conditions in Lemma 0): Note that A 1 and B l are related to 
the transition probability of the reference states. Condition (l32l ) simply means that there is one reference 
state accessible from all the other reference states after some finite number of transition steps. This is a 
very mild condition and will be satisfied in most of the cases in practice. 

Note that (l33l) is equivalent to the following Bellman equation on the representative states Sr: 

M 

9 + V™{q) = min |ff(/3 m , g , n(/3 m ,,), 7fc ) + J>r [Q j \P m , q , n(/3 m , 9 )] J] V/3 m>(? € Sr. 

QJ m=S 

Hence, Lemma |4] basically guarantees the proposed online learning algorithm will converge to the best 
fit parameter vector (per-node potential) satisfying ((2Tb . On the other hand, since the ratio of step sizes 
satisfies ^f,^f — > during the LM update (timescale II), the per-node value function will be updated 
much faster than the Lagrange multipliers. Hence, the update of Lagrange multipliers in timescale II will 
trigger another update process of the per-node value function in timescale I. By the Corollary 2.1 of |[23l . 
we have lim ||V^ - V~(7 i )|| = w.p.l. Hence, during the LM updates in ([3D, (El and ([29]>, the 

t— >oo 

per-node value function update in ( f28T > is seen as almost equilibrated. Moreover, convergence of the LMs 
is summarized below. 
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Lemma 5 ( Convergence of the LMs over Timescale II): The iteration on the vector of LMs 7 = [ys,d, 7s i?) 
li, P , ■■■,7M,p] T converges almost surely to 7* = [7s,d> 7<? )P) 7i, p , -,7m,/. which satisfies the power and 
packet drop rate constraints in (f]~4l) .([T5l) and ([T3l . 

Proof: Please refer to Appendix E. ■ 

Based on the above lemmas, we summarized the convergence performance of the online per-node value 
functions and LMs learning algorithm in the following theorem. 

Theorem 1 ( Convergence of Online Learning Algorithm |7]).- For the same conditions as in LemmalU we 
have (t*,W*) -» (7*,W°°(7*)) w.p.l., where (7*,W°°(7*)) satisfies 6>e+W°° (7* ) = M- 1 T( 7 *, MW°° 
and the average power constraint (1 141 15b as well as the average packet drop rate constraint (fT3l ). where e 
is a (M + 1)(Nq + 1) x 1 vector with all elements equal to 1. 

D. Asymptotic Optimality 

Finally, we shall show that the performance of the distributive algorithm is asymptotically global optimal 
for high traffic loading. 

Theorem 2 (Asymptotically Global Optimal at High Traffic Loading): For sufficiently large Nq and high 
traffic loading such that the optimization problem in ProblemUJis feasible, the performance of the proposed 
distributive control algorithm is asymptotically global optimal. 

Proof: Please refer to Appendix F. ■ 

V. Simulations and Discussions 

In this section, we shall compare our proposed online per-node value function learning algorithm to 
five reference baselines. Baseline 1 and 4 refer to the proposed buffered decode and forward (BDF) 
protocol with throughput optimal policy (in stability sense), namely the dynamic backpressure algorithm 
|[24l . where we utilize full-duplex RSs in Baseline 1 and half-duplex RSs in Baseline 4. Baseline 2 and 
5 refer to the regular decode-and-forward protocol (DF) with the CSIT only scheduling (the link selection 
and power allocation are adaptive to the CSIT only so as to optimize the end-to-end throughput). We 
utilize full-duplex RSs in Baseline 2 and half-duplex RSs in Baseline 5. Moreover, Baseline 3 refers to the 
proposed BDF protocol with CSIT only scheduling and half-duplex RSs. In the simulations, we assume 
the total bandwidth is 1 MHz, the packet arrival at the source node is Poisson with average arrival rate 
As = 200pck/s and deterministic packet size Nb bits. The number of antennas at the source node and the 
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Fig. 3. Average end-to-end delay versus average transmit SNR. Baseline 1 refers to the dynamic backpressure algorithm with 
BDF protocol and full-duplex relays. Baseline 2 refers to the CSIT only scheduling with traditional DF protocol and full-duplex 
relays. Baseline 3 refers to the CSIT only scheduling with BDF protocol and half-duplex relays. Baseline 4 refers to the dynamic 
backpressure algorithm with BDF protocol and half-duplex relays. Baseline 5 refers to the CSIT only scheduling with traditional 
DF protocol and half-duplex relays. The deterministic packet size is Nb = 25K bits and the number of antennas at each RS is 
Nr = 4. The packet drop rates of the Baselines 1-5 and the proposed distributive online learning are 0.2% 0.2% 13%, 3%, 24% 
and 0.2% respectively. 



1.5 r 




2 4 6 8 10 

Average Transmit Power(dB) 



Fig. 4. Average throughput versus average transmit SNR. The deterministic packet size is Nb — 30K bits and the number of 
antennas at each RS is Nr = 4. The packet drop rates of the Baselines 1-5 and the proposed distributive online learning are all 
10%. 



destination node is = 2. Moreover, the maximum buffer size of each node (source node and RSs) is 
N Q = 10. 

Figure [3] and Figure [4] illustrate the average end-to-end delay and average throughput versus average 
transmit SNR per node with Nr = 4 antennas at each RS, respectively. It can be observed that the proposed 
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Fig. 5. Average end-to-end delay versus the number of relays with transmit SNR = 5.5dB. The deterministic packet size is 
Nb = 25K bits and the number of antennas at each RS is Nr = 4. The packet drop rates of the Baseline 1-5 and the proposed 
distributive online learning are 23%, 23%, 86%, 82%, 96% and 0.5% respectively. 




Fig. 6. Average end-to-end delay versus the number of relay antennas with transmit SNR = 5dB. The deterministic packet size 
is Nb ~ 20K bits and the number of antennas at each RS is Nr = 4. The packet drop rates of the Baseline 1-5 and the proposed 
distributive online learning are 3%, 4%, 9%, 5%, 20% and 0.1%, respectively. 



distributive algorithm with half-duplex RS could achieve significant performance gain in both average 
delay and average throughput over all baselines with full-duplex RSs, and even more significant gain over 
the baselines with half-duplex RSs. This illustrates the advantages of the proposed BDF algorithm with 
distributive delay-optimal control policy, which could effectively reduce the intrinsic half-duplex penalty 
in the cooperative communication systems. 
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Figure [5] and Figure [6] illustrate the average end-to-end delay versus the number of relays and the number 
of relay antennas with Nr = 4 antennas at each RS, respectively. It can be observed that the average 
delay of all the schemes decreases as the number of relays or the number of relay antennas increases. 
Furthermore, the proposed BDF algorithm with distributive delay-optimal control policy has significant 
gain in delay over all the baselines. 

Figure |7] illustrates the convergence property of the proposed distributive online learning algorithm. We 
plot the per-node value function of the first relay versus scheduling slot index at a transmit SNR= lOdB. 
The average delay at the 200-th scheduling slot is already very close to the steady-state value, which is 
much better than all the baselines. Furthermore, unlike the iterations in deterministic NUM problems, the 
proposed algorithm is online, meaning that normal payload is delivered during the iteration steps. 



Average Delay 
Proposed: 2.5 
Baseline 1 (Backpressure.FD.BDF): 4.7 
Baseline 2 (CSIT.FD.DF): 4.9 
Baseline 3 (CSIT.HD.BDF): 32 
Baseline 4 (Backpressure. HD, BDF): 30 
Baseline 5 (CSIT.HD.DF): 43 
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Fig. 7. Illustration of the convergence of the proposed online learning algorithm. The instantaneous per-node value function is 
plotted versus time slot index for a cooperative MIMO system with a source node (with 2 antennas) and 2 RS nodes (each with 
4 antennas). The transmit SNR of the source and the RS nodes are 10 dB and the target packet drop rate is 0.2%. Unlike the 
iterations in deterministic NUM problems, the proposed algorithm is online, meaning that normal payload is delivered during the 
iteration steps. 



Average Delay 
Proposed: 2.6 
(Backpressure.FD.BDF): 4.7 
2 [CSIT.FD.DF]: 4.9 
Baseline 3 (CSIT.HD.BDF): 32 
Baseline 4 (Backpressure.HD.BDF): 30 
" line 5 (CSIT.HD.DF): 43 




VI. Summary 

In this paper, we consider queue-aware resource control for two-hop cooperative MIMO systems. We 
show that by exploiting buffering in each MIMO relay, we could substantially reduce the intrinsic half- 
duplex loss in cooperative systems. The delay-optimal resource control policy is formulated as an average- 
cost infinite horizon Markov Decision Process (MDP). To obtain a low complexity solution, we approximate 
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the value function by a linear combination of per-node value functions. The per-node value function is 
obtained using a distributive stochastic learning algorithm. We also established technical conditions for 
almost-sure convergence and show that in heavy traffic limit, the proposed low complexity distributive 
algorithm converges to global optimal solution. 

Appendix A: Proof of LemmaQ] 

The average number of bits received by the source node is given by As(l — D), which is also the 
average number of information bits received by the relay clusters as the source node and the relay cluster 
are cascade. Let W , Ws and W R be the average time (with the unit of frames) one information bit 
staying in the system, the source node's queue and some relay's queue respectively, Ns and N R be the 
average number of information bits in the source node's queue and the relays' queues respectively, we have 
Ns = (1 - D) \ S W S and N R = (1 - D)X S W R by Little's Law. Notice that W = W s + W R , we have W = 



jVs+A D) • Since the change of system queue state forms a Markov chain, we have W = E T 



XsO—D) 



A s (l 

where tt k is the steady state distribution. For sufficiently small packet drop rate requirement 1 — D « 1, 



the end to end average delay becomes W = 

Appendix B: Proof of Lemma [2] 
From the Bellman equation of the original state space ( fT8T ), we have 
8 + V(Q\H) = mm l |< 7 ((Q i ,H),n(Q i ,H), 7 ) + £ Pr [(Q^H')|(Q l ,H),n(Q i ,H)] J(Q^H')} 

(Q J ,H') 



(a) min {<7((Q J ,H),n(Q l ,H), 7 ) + J]Pr [Q^|(Q i ,H),n(Q i ,H)]F(Q^)}, (34) 
n(Q i ,H) ^ 

where (a) is due to the definition V(Q J ") = EH'[V r (Q- 7 ',H / )|Q- ? '] s and the optimal control actions are given 
by n*(Q'\H) = argminn (Q! ,H) {g{(Q\ H), n(Q i , H), 7) + £ Qj Pr [Q^|(Q\ H), n(Q*, H)] V(Q^)}. 
Thus, by the partitioning of the optimal control actions in Definition Q] i.e. I1*(Q 4 ) = {n*(Q l ,H)|VH}, 

n*(Q J ) = arg min VPr(H){ 5 ((Q l ,H),n(Q\H), 7 )+^Pr [QJ '\(Q\ H), U(Q\ H)] V(<¥)} 

(35) 

From (O and we have 8 + Pr(H)F(Q*, H) = min n (Q.) Eh Pr ( H ){5((Q*, H), n(Q i , H), 7) + 
E QJ Pr [Q J |(Q ? ,H),n(Q l ,H)]y(Q^)} ( = min n(Qi) [g(Q\ n(Q J ), l)+Z Qj Pr [Q 3 \Q\ n(Q 1 )] V(Q^)}, 
where the equality (b) is due to the definition of g in ( fT9l ). As a result, the control policy obtained by 
solving (fT8l ) is the same as that obtained by solving (fTTT ) and this completes the proof. 
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Appendix C: Proof of Lemma [3] 

We shall prove the general control policy first, followed by the closed-form power control derivation. 
According to ((22)) . given Nsr and Nrd, the optimal power control is given by: 



Therefore, p*s, m ( N SR) = ar S mm p s , m Gs,m{N S R,Ps,m) andpl n D (N SR ) = argmin Pm D G ntD (N RD ,p n , D )- 
To determine the optimal Rx-RS, Tx-RS and stream allocation, the biding is divided into two stages: 

• First Biding: Each RS (say the m-th RS) broadcasts one bid for each possible Nsr indicating that if 
itself is selected as Rx-RS and the number of S-R streams is Nsr, what would be the corresponding 



• Second Biding: After receiving the bids in the first round, each RS (say the n-th RS) should calculate 
that if itself is selected as the Tx-RS, which RS else is the best Rx-RS (say the m-th RS is the best 
Rx-RS), what's the best Nsr and Nrd and what's the corresponding B* = Gs, m (NsR,p*s m ) + 
G n ,D{NRD,P* n £>)■ Then, broadcast the calculation results B* as the second bid. 

• After comparing the B*, the optimal Rx-RS, Tx-RS and stream allocation can be determined. 
Therefore, the first-stage bidding and the second-stage bidding is straight-forward. 

When As and (m = S, 1, 2, M) are sufficiently large, it with large probability that Of- (m = 
S, 1, 2, M) is sufficiently large. Hence, following a similar approach in [23], it can be proved that 
the value function V m (m = 5, 1,2, ...,M) is increasing polynomially in Q = [Qs, Qi, Qm] T - The 
optimization on ps, m is given by 



mm E H [ £ I^Gs, m (N S R,PS,m)+ £ I^W G n ,D (Nrd , f?n,£>)] } 




E H [ Yl I S,™™ inG S,m(N S R,PS, m )+ Yl In,D ^ G nA N RD,Pn,D)]} 




G S ,m( N SR,P S ,m)- 




1 1 




(36) 



Similar to (25j, we can do Taylor expansion as follows: 




(37) 




(38) 



October 5, 2010 



25 



where V s is the first order derivative on Vs and the higher order is neglectable. Same apporach can be used 

to expand V m (Q l m +Rs,m(N SR ,p Stm )) asV m (Q l m +R St m{N S R,Ps, m )) = V m (Qln)+ R s,rn( N SR,Ps, m )V m (Q l m 

At high SNR region, we have 

dR S ,rn( N SR,PS, m ) _ N 1 



dPs,m ln2 Wjm + £fjf i ' 



(39) 



According to (137138139b , taking derivative on the RHS of d36l) and letting it be zero, we can get the closed- 
from expression for power allocation in (|26l ). Moreover, (|27l ) can be proved in the same way. Finally, when 
Q m and Qs are sufficiently large, according to the definition of derivative, we have 

V SWS) — 2 V mWm) — 2 ' 

Appendix D: Proof of Lemma [4] 

From Eol . the convergence property of the asynchronous update and synchronous update is the same. 
Therefore, we consider the convergence of related synchronous version without loss of generality. 

Let c G R be a constant, we have Tj(cVg) = cTj(Vg), where Tj is one element of mapping T 
corresponding to the state with all buffers empty. Similar to E71 . the per-node value function {V m } 
is bounded almost surely during the iterations of algorithm. According to the construction of parameter 
vector W, the update on V m is equivalent to the update on W and proving the convergence of Lemma [4] 
is equivalent to proving the convergence of update on W. In the following, we first introduce and prove 
the following lemma on the convergence of learning noise. 

Lemma 6: Define q ; = g(n 1 )+P(n i )MW ! -MW I -T f (MW ! )e , when the number of iterations 
I > j — y oo, the procedure of update can be written as follows with probability 1: W' +1 = W J + 
Y l ■ e i a i 

The proof of above lemma follows the standard approach of stochastic approximation with Martingale 
noise [221 Moreover, the following lemma is about the limit of sequence {qj„}- 
Lemma 7: Suppose the following two inequalities are true for I = a, a + 1, ...,a + b 

) + p(n')MW ! <g(n' _1 ) + p(n z_1 )Mw z (40) 
g(n /_1 ) + p(it'~ 1 )mw'~ 1 <g(n') + pp'jMW 1 - 1 , (4i) 



then we have 



Lil-i 



\<£ +b 



< Ci ' [ (1 - r a+i/3 ) Vi, (42) 



i=0 
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•(1 - r t )[minq'- /3 ] < q z + C x e < (1 - r^maxq^l 



where denotes the ith element of the vector q_ a+b , C\ is some constant. 
Proof: From and (HZ]), we have 

q' = Mt [g(n z ) + P(n')MW l - MW ! - wie] < M t [KB* -1 ) + P(n'- 1 )MW i - MW 1 - une] 

q' -1 =M f [g(tf -1 ) + P(n^ 1 )MW'~ 1 - MW 1 " 1 - ^_ie] 
<M t [g(tf ) + P(rf)MW^ 1 - MW 1 " 1 - iwj_ie] 

w/iere iuj = T/(MW ! ) = Tf(MW'). According to Lemma\6\ we have W l = W^+e^q 1 " 1 =>W' = 
W'- 1 + 4 -1 q ,-:L . Therefore, 

q' < [(1 - 4" 1 ) 1 + Mtp^-^Me^ 1 ]^- 1 + w,_ie - w t e = B I-1 q ,_1 + u^e - w t e 
q Z > [(1 - + M^PiU^Me 1 - 1 ) q'" 1 + w^e - Wl e = A z ~ V -1 + w^e - iwje. 

Aforice fftrf A i_1 e = B^e, we ftav<? A'-^.A'-'V - ' 3 - C x e < q l < B^.-.B^q^ - Cie 

maxq' + C\ < (1 — r J ) maxq' _/3 
minq' + Ci > (1 — r l ) minq /_/3 
=4>maxq' - minq z < (1 - r^maxq'"' 3 - minq /_/3 ] =>• < maxq' - minq' < Ci{l - t 1 ) Vi, 

where the first step is due to conditions of Lemma ^ on matrix sequence {A'} an d {B'}, maxq' and 
min q' denote the maximum and minimum elements in q' respectively, C\ and C2 are all constants, the 
first inequality of the last step is because min q' < 0. This completes the proof of Lemma ■ 

Therefore, the proof of Lemma [4] can be divided into the following steps: (1) From the property of 

I— J— 1 

sequence {e^}, we have ni=o ~~ e « ) (I —> 00). (2) According to the first step, note that 
t 1 = 0(e v ), from (|42)) . we have q' — > (I —> 00). (3) Therefore, the update on {W f } will converge, and 
the fixed point of the convergence W°° satisfies T/(MW')e + W°° = MtT(MW°°). 

Appendix E: Proof of Lemma [5] 
Due to the page limit, we only provide the sketch of the proof. The convergence proof of the LMs 
{lS,p,li,p, -,1m, p } for a given j s ,d is as follows: 

• For the notation convenience, we first define the average transmit power of each node as follows: 

■V (~,\ — T?n r srM sr^rain(N T ,N R ) i i , _ „n [ ^ M ^mm(N T ,N R ) ; j 

P S {j)-iL Lm=lLi=l Vs,mPs,m and l J m{l) ~ & Lm=lLi=l V m ,DP m ,D 
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(m = 1,2, ...,M), where E n [-] denotes the expectation w.r.t. the policy 11(7). Using standard 
stochastic approximation theory, the dynamics of the LMs update equation {75^,71^, •••,7M,p} can 
be represented by the following ODE: 

[is, P (t), ...,7M, P (t)] T = [v s (rt) - Ps,Pi(j) - p r , ...,v M (i) - Pr] t - m 

• Using perturbation analysis in [28], we have < (m = S, 1, 2, M) and 

(m = S, 1, 2, M,n ^ m). Thus, the update of 7 m , p (m = 5, 1, M) in ODE (g3]l will 
drive V m — Pr (or Vs — Ps) to whenever V m — Pr (or Vs — Ps) is non-zero. Therefore, the ODE 
dH) will converge. The converged LMs {l s , P (ls,d), li, p (ls,d), ■-, lh, p (ls,d)} can be characterized 
by the equilibrium point of the ODE (|43l ), which is given by the RHS of ( f43T > — >• 0. 

Suppose for a given 75^, {-fs,p,7i, P , -,1M, P } converge to {75, P (7s,d), 7i, P (7s,d), lli, p {ls,d)}- Since 
9(e" _ , [Qs=7V Q ]) 

^ < 0, the update on y$,d wm converge as well for a similar reason as in the 

convergence of {js, P , Ji,p, 7M,p}- Similarly, the converged 7^ d can be characterized by the equilibrium 
point of the ODE js,d(t) = E^, 7 ^ >7 , [Q s = Nq] - D, which is given by the RHS -4 0. 

Appendix F: Proof of Theorem |2] 

Without loss of generality, we shall consider the approximate value function V(Q) = ^2m=s ^2^=1 Vm(q) 
I[Qm = q] on the following redefined set of representative states Qr = {5 nhq \m = S, 1, 2, M; q = 
0,1,..., qi - l,qi + 1,...,Nq}, where the state S m>q is given by S m>q = [Q s = qi,Qi = qi,-,Q m = 
q, ...,Qm = Qi] T an d qi < Nq is sufficiently large. Correspondingly, M" 1 should also be redefined such 
that the per-node value function {V m } is updated on the representative states Qr ll2Ti . 

First of all, following the similar approach in the proof of Lemma @1 the per-node value function (under 
the new reference states) would also converge almost surely to {V^(7)} for any given LMs 7. 

Next, when the conditions of Theorem [2] are satisfied, given any e > 0, there is one integer Qo(e) such 
that for all q > Qo(e) and qj = Qo(e), we have (from the proof of Lemma [3]): 

V™(q - r) - V™{q) = V™( Ql - r) - V™(qj) + 0(e). (44) 

Moreover, since {V^(q)} are all monotonically increasing functions with respect to q and {V^(Nq)} 



dVrrJjX 



>> 
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are all boundeco we have V m yQo(e)) = 0(e) for sufficiently large arrivals. Therefore, (l44l holds for 
all q € [0, Nq] for sufficiently large Nq and input arrivals. Similarly, we have 



V§°(q + n - r) - V§°(q + n) = V§°( qi + n 



V§°( qi + n) + 0(e) 



V™(q + r) - V™{q) = V£( qi + r) - V£(qi) + 0(e). 



(45) 



(46) 



Hence, with the above equations and substituting the converged per-node value function {V^(7)} into 
(fT8l ) for the reference states, we get 



V$°(q) =q + is A* = Nq] + Y1 fx(n) (^°(« + n) - V£° (n)) + minE H { £ 

n 

+ f x (n) (V$°(q + n - rj«) - V§°(q + n)) + Kffe + rj: 



'/5,m 



fc 



CO?) =9 + (9) + minE H { £ ky, + 



JVfi 



(47) 
(48) 



where m = 1, 2, M. 

Finally, for any system state Q* = [Q l s , ...,Q l M ] T , substitute the above equations into the RHS of the 
original Bellman equation in CES>, we get RHS of<EQ = E™=s Qln+lsAQs = N q]+Z u fx(n)V^(Q i s + 



n) + EJL VmiQU) + mm n(Q< ) E H { E m ,iv SH [ls, P P N s ^ + En /*(») (v£°(<& + * 



_ Nrb' 
im ' m,D ■ 



-V c 



'(Q l m )}}+ 



O(e) = E™=s OQm) + En + 0(e) = 7 (Q l ) + En + O(e), where equality 

(a) is due to (l46l ). equality (b) is due to (|47T ) and (|48T ). Since E n fx{n)Vg°{n) is a constant indepen- 
dent of Q* and e is chosen arbitrarily, we have shown that the approximate value function V(Q) = 
Em=s Eo^i ^°(g)I[ ( 5m — ?] can satisfy the original Bellman equation ( fT8l ) asymptotically (when 
iVg — )• +oo). As a result, the proposed distributive update algorithm converges to the global optimal 
solution and this completes the proof. 
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